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Introduction 


► Need actionable performance and troubleshooting data at 
interactive speeds. 

► /proc and Ictl dk are useful, but: 

► Performance issues at scale 

► Don’t want to pollute logs 

► Want more information 


Systemtap 


► SystemTap consists of a scripting language, translator and 
runtime. 

► Provides system-wide tracing capabilities. 

► Kernel and userspace. 

► strace traces a process tree; SystemTap provides visibility 
across the entire system. 


Using SystemTap with Lustre 


► Where to probe? 

► This is the hard part - need to have some understanding of 
how Lustre works. 

► Extract data from functions as they are called/return. 

► Output as you go, or 

► Aggregate and periodically display 

► Timing function calls. 

► Lustre service threads handle RPCs from start to finish, one at 
a time 

► Makes it easy to store timing, other information based on the 
thread handling the request. 


Example: Timing Idiskfs block allocations 


global start 
global times 
probe 

module("ldiskfs") ,function("ldiskfs_mb_new_blocks") { 
start [tid()] = gettimeof day_ms () ; 

> 

probe 

module ("Idiskfs") . f unction ( "Idiskfs _mb_new_blocks") . retun 
if ([tid()] in start) { 

times «< gettimeof day _ms () - start [tid()] ; 

> 

> 

probe end { 

print (@hist_log(times) ) ; 

> 


Output 


value 


count 

0 


27083 

1 

|@ 
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2 

1 

6 

4 

1 

2 

8 

1 

158 

16 

1 

74 

32 

1 

33 

64 

1 

10 
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1 

6 
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1 

2 
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1 

1 

32768 

1 

1 


That’s between 32 
and 64 seconds ! 


Examples 


► Poor choice of stripe count. 

► Fragmentation. 

► High OSS load average. 


Poor choice of stripe count 


► The default stripe count on our filesystems is one. 

► Average file size is relatively small, so this is OK. 

► Except. . . 

► Large tar files can take up a significant portion of an OST. 

► Many ranks writing to a file on one OST can perform poorly. 


big-object 


► A SystemTap script intercepts calls in the write path on the 
OSS to gather the following information: 

► NID 

► OST name 

► object ID 

► FID 

► UID 

► object size 

► When there's a a write to an object over a predetermined size, 
print it. 

► A Python wrapper gathers additional information about the 
object and writing process, including the path. 


big-object 


servicel62 ~ # big-object 

Fri Mar 23 10:21:09 2012 service61-ib ost :nbp2-0ST0021 
stripes :1 pid:4320 command: tar size : 506123MB 
name : /nobackupp2/ . . . /something. tar 


Fragmentation 


► On-disk 

► Block allocator can't find a large enough chunk of contiguous 
free space. 

► Delays writes; fragmented allocation will cause more I/Os for 
both reads and writes. 

► Memory 

► The IB SRP driver can only handle scatter-gather descriptors 
up to length 255. 


Showing I/O fragmentation in real-time 


► Use SystemTap to hook into Lustre I/O path. 

► A good I/O - 1MB or more in a single write: 

nid: 10 . 151 . 18 . 95@o2ib0 ost : nbp2-0ST0010 uid:0 
mdt_inode:0 sizes: 256 

► Memory fragmentation causing SRP to issue multiple I/Os: 

nid: 10 . 151 . 14 . 211@o2ib0 ost :nbp2-0ST0008 uid:0 mdt_inode 
sizes : 255 (255) 1 


On-disk fragmentation 


nid: 
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High OSS load average 


► High OSS load average is usually due to long disk queues. 

► A typical cause is many hosts performing I/O to a file on a 
small number of OSTs. 

► You could mine the data in /proc. 

► On a large system, this takes time. 

► I want to know which file is being accessed. 

► Currently possible for writes. 

► May require modifications to Lustre for reads. 


oststat 


OST 

r/s 

w/s 

aveq 

nbp2-0ST06 

80 

42 

4 

nbp2-0ST0e 

37 

2 

0 

nbp2-0ST16 

50 

0 

1 

nbp2-0STle 

24 

0 

1 

nbp2-0ST26 

44 

18 

3 

nbp2-0ST2e 

79 

9 

1 

nbp2-0ST36 

41 
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135 

nbp2-0ST3e 

80 

2 

3 

nbp2-0ST46 

63 

2 

3 

nbp2-0ST4e 

73 

2 

3 
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43 

13 

0 
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35 

1 

0 
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67 

1 

2 
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104 

8 

3 


wwat 
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72 
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27 

66348. pbspll 

6 

1 

37 
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8 

0 

59 
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6 
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39 
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6 

57 
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12 

1 

54 
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74 
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23 
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6 

6 

18 
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1 

39 
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6 
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Future work 


► Started the NASA open-source process. Distribution will 
include: 

► Lustre tapset library 

► big-object and oststat 

► Mechanism for mapping hosts to your site’s batch system 

► More tools! 

► Send me your ideas. 

► Better yet, patches :-) 


► Visualization 


Questions? 


